Electronics Store Sales Analysis

Advanced R Programming - Final Project

Author

Hein Htet SOE THAN

Published

February 9, 2026

0.1 Section 1: Introduction

0.1.1 Research Questions & Hypotheses

To evaluate the performance of the electronics store, this project will address the following questions:

Product Hierarchy: Which product categories contribute most to the total revenue, and does a higher unit price correlate with higher total sales volume?

Hypothesis: While premium items (e.g., MacBooks) generate high revenue per sale, mid-tier accessories (e.g., USB-C Charging Cables) will have the highest frequency of transactions, driving consistent cash flow.

Temporal & Geographic Patterns: Are there specific “peak hours” for electronics shopping that differ across major US cities?

Hypothesis: Tech-heavy hubs like San Francisco will show higher purchase volumes during late-evening hours compared to other metropolitan areas.

0.1.2 Dataset Description

The dataset used in this analysis represents 12 months of sales data from a multi-city electronics retailer.

Source: The data is sourced from the Kaggle Electronics Store Dataset.

Size: It contains approximately 186,000 rows and 6 initial columns.

Content: It includes variables such as Order ID, Product, Quantity Ordered, Price Each, Order Date, and Purchase Address.

0.1.3 Relevance

This dataset is critical for answering my research questions because it provides the granularity required for both temporal (time-based) and categorical analysis. The inclusion of timestamped orders allows for the application of lubridate to extract seasonal trends, while the product-level pricing enables a deep dive into revenue distribution using dplyr’s .by argument.

0.1.4 Analytical Roadmap

Exploration: Use skimr to identify data distributions and missingness.

Cleaning: Parse dates, split addresses into cities, and handle numeric conversions.

Analysis: Filter for high-value transactions and use grouped summaries to compare city-level performance.

Visualization: Create a dashboard of sales trends using patchwork and plotly to identify peak periods visually.

0.1.5 Loading Libraries and Data

# Loading necessary libraries for the project
library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate) 
library(patchwork) 
library(plotly)    

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(DT)        
library(skimr)     
library(scales)    

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(knitr)
# inslalling webshot2 visual in pdf
install.packages("webshot2")
Installing package into '/Users/heinhtetsoethan/Library/R/arm64/4.5/library'
(as 'lib' is unspecified)
Error in `contrib.url()`:
! trying to use CRAN without setting a mirror
# for pdf export
install.packages("tinytex")
Installing package into '/Users/heinhtetsoethan/Library/R/arm64/4.5/library'
(as 'lib' is unspecified)
Error in `contrib.url()`:
! trying to use CRAN without setting a mirror
tinytex::install_tinytex()
Found '/usr/local/bin/tlmgr', which indicates a LaTeX distribution may have existed in the system.
Error in `tinytex::install_tinytex()`:
! If you want to force installing TinyTeX anyway, use tinytex::install_tinytex(force = TRUE).

0.1.6 Why this library?

  • tidyverse: Provides a cohesive set of tools for data manipulation (dplyr), visualization (ggplot2), and data tidying (tidyr). Its consistent syntax allows for efficient data wrangling and analysis.
  • lubridate: Simplifies the parsing and manipulation of date-time data, which is crucial for analyzing temporal patterns in sales.
  • patchwork: Enables the combination of multiple ggplot2 plots into a single cohesive dashboard, facilitating comparative analysis across different dimensions.
  • plotly: Allows for the creation of interactive visualizations that can enhance user engagement and provide deeper insights through hover information and dynamic filtering.
  • DT: Provides an interactive interface for displaying data tables, making it easier to explore and understand the dataset.
  • skimr: Offers a comprehensive overview of the dataset, including distributions, missing values, and summary statistics, which is essential for the initial data exploration phase.
  • scales: Useful for formatting axes and labels, particularly when dealing with financial data, to improve readability and presentation quality.
  • knitr: Useful for formatting axes and labels, particularly when dealing with financial data, to improve readability and presentation quality.

0.2 Section 2: Data Exploration

In this section, I import the raw sales data and perform an initial inspection to understand the variable distributions and identify potential data quality issues.

0.2.1 Merging Datasets manually

I will consolidate the 12 monthly CSV files into a single primary data frame using a functional programming approach. During the import phase, I will use lubridate to parse the Order Date and create a dedicated Month identifier. This consolidation is essential for performing longitudinal analysis and ensures that cross-month trends—such as seasonal spikes—are captured accurately in a single, unified tibble.

Note

I uploated the raw data set on GitHub.

# Taking the data from GitHub
if (!dir.exists("Monthly_Sales_Data")) {
  dir.create("Monthly_Sales_Data")
  
  data_url <- "https://github.com/Hein96/Electronic-Sales-Data.git"
  dest_file <- "Monthly_Sales_Data.zip"
  
  download.file(data_url, dest_file, mode = "wb")
  unzip(dest_file, exdir = "Monthly_Sales_Data")
}
sales_data_mer <- list.files(path = "Monthly_Sales_Data/", pattern = "*.csv", full.names = TRUE) 
sales_raw <- sales_data_mer |> 
  map_df(~read_csv(.))
Rows: 18383 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 12011 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 25117 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 12036 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 9723 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 14371 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 13622 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 15226 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 16635 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 17661 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 20379 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 11686 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sales_raw) # Get a quick overview of the data structure
Rows: 186,850
Columns: 6
$ `Order ID`         <chr> "176558", NA, "176559", "176560", "176560", "176561…
$ Product            <chr> "USB-C Charging Cable", NA, "Bose SoundSport Headph…
$ `Quantity Ordered` <chr> "2", NA, "1", "1", "1", "1", "1", "1", "1", "1", "1…
$ `Price Each`       <chr> "11.95", NA, "99.99", "600", "11.99", "11.99", "11.…
$ `Order Date`       <chr> "04/19/19 08:46", NA, "04/07/19 22:30", "04/12/19 1…
$ `Purchase Address` <chr> "917 1st St, Dallas, TX 75001", NA, "682 Chestnut S…
#| dimensions: true
cat("Number of rows:", nrow(sales_raw), "\n") 
Number of rows: 186850 
cat("Number of columns:", ncol(sales_raw), "\n")
Number of columns: 6 
#| display: true
head(sales_raw) # Display the first few rows of the dataset
# A tibble: 6 × 6
  `Order ID` Product                `Quantity Ordered` `Price Each` `Order Date`
  <chr>      <chr>                  <chr>              <chr>        <chr>       
1 176558     USB-C Charging Cable   2                  11.95        04/19/19 08…
2 <NA>       <NA>                   <NA>               <NA>         <NA>        
3 176559     Bose SoundSport Headp… 1                  99.99        04/07/19 22…
4 176560     Google Phone           1                  600          04/12/19 14…
5 176560     Wired Headphones       1                  11.99        04/12/19 14…
6 176561     Wired Headphones       1                  11.99        04/30/19 09…
# ℹ 1 more variable: `Purchase Address` <chr>
#| summary: true
summary(sales_raw) # Get summary statistics for numeric columns
   Order ID           Product          Quantity Ordered    Price Each       
 Length:186850      Length:186850      Length:186850      Length:186850     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
  Order Date        Purchase Address  
 Length:186850      Length:186850     
 Class :character   Class :character  
 Mode  :character   Mode  :character  
skimr::skim(sales_raw) # Get a comprehensive overview of the dataset
Data summary
Name sales_raw
Number of rows 186850
Number of columns 6
_______________________
Column type frequency:
character 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Order ID 545 1 6 8 0 178438 0
Product 545 1 6 26 0 20 0
Quantity Ordered 545 1 1 16 0 10 0
Price Each 545 1 3 10 0 24 0
Order Date 545 1 10 14 0 142396 0
Purchase Address 545 1 16 42 0 140788 0
sales_data |> 
  summarise(
    mean_price = mean(`Price Each`, na.rm = TRUE),
    median_price = median(`Price Each`, na.rm = TRUE),
    sd_price = sd(`Price Each`, na.rm = TRUE),
    total_records = n()
  ) |> 
  knitr::kable(caption = "Custom Summary Statistics for Unit Price")
Error:
! object 'sales_data' not found

I am encountering NA because the Price Each column contains non-numeric values (e.g., “Price Each” as a header or other text entries). To address this, I will need to clean the data by converting the Price Each column to numeric and handling any non-numeric entries appropriately.

sales_raw <- sales_raw |> 
  mutate(`Price Each` = as.numeric(`Price Each`)) # Convert to numeric, coercing non-numeric values to NA
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `Price Each = as.numeric(`Price Each`)`.
Caused by warning:
! NAs introduced by coercion
sales_raw |> 
  summarise(
    mean_price = mean(`Price Each`, na.rm = TRUE),
    median_price = median(`Price Each`, na.rm = TRUE),
    sd_price = sd(`Price Each`, na.rm = TRUE),
    total_records = n()
  ) |> 
  knitr::kable(caption = "Custom Summary Statistics for Unit Price")
Custom Summary Statistics for Unit Price
mean_price median_price sd_price total_records
184.3997 14.95 332.7313 186850

Now that the Price Each column has been converted to numeric, we can see the summary statistics without NA values affecting the calculations. The mean, median, and standard deviation of the unit price are now accurately represented, allowing for a better understanding of the pricing distribution in the dataset.

Note

Technical Choice: Data Consolidation While the raw data consists of 12 individual monthly files, I have chosen to perform the primary exploration on the consolidated dataset. This allows for a holistic view of the store’s annual performance and ensures that the summary statistics (mean, median, etc.) reflect the entire fiscal year rather than a single month.

0.2.2 Initial Observations & Anomalies

Based on the exploration above, I have identified several patterns and issues that must be addressed in the cleaning phase:

Data Type Mismatch: The Order Date is currently stored as a character string and needs conversion to a POSIXct object.

Missing Values: Preliminary checks via skim() indicate a small percentage of empty rows that need to be removed.

Column Naming: Several column names contain spaces (e.g., Order ID), which makes coding cumbersome. These will be renamed for consistency.

Logical Anomalies: I observed that Quantity Ordered is currently a character type in some instances; this must be coerced to numeric to allow for revenue calculation.

0.3 Section 3: Data Cleaning

0.3.1 Initial Integrity Check

Before cleaning, I calculate the total number of missing values to understand the extent of data loss.

sum(is.na(sales_raw))
[1] 3625

0.3.2 The Cleaning Pipeline

I used a single piped workflow to clean the data. I chose to drop rows with missing Order IDs rather than imputing them, as an Order ID is a unique identifier that cannot be accurately guessed. For prices, I used grouped median imputation to maintain data integrity.

sales_cleaned <- sales_raw |>
  #| 1. Standardizing column names
  rename(
    order_id = `Order ID`,
    product = `Product`,
    quantity = `Quantity Ordered`,
    unit_price = `Price Each`,
    order_date = `Order Date`,
    purchase_address = `Purchase Address`
  ) |>
  #| 2. Filter out the repeating header rows 
  filter(product != "Product") |>
  #| 3. Fix data types and parse dates
  mutate(
    quantity = as.numeric(quantity),
    unit_price = as.numeric(unit_price),
    order_date = mdy_hm(order_date) # Parsing with lubridate
  ) |>
  #| 4. String cleaning: Extract City 
  #| I extract the text between the two commas in the address
  mutate(
    city = str_split_i(purchase_address, ",", 2) |> str_trim()
  ) |>
  #| 5. Grouped Imputation 
  mutate(
    unit_price = if_else(
      is.na(unit_price), 
      median(unit_price, na.rm = TRUE), 
      unit_price
    ),
    .by = product
  ) |>
  #| 6. Drop rows with missing Order IDs and in quantity
  drop_na(order_id, quantity)
sum(is.na(sales_cleaned))
[1] 0
head(sales_cleaned)
# A tibble: 6 × 7
  order_id product      quantity unit_price order_date          purchase_address
  <chr>    <chr>           <dbl>      <dbl> <dttm>              <chr>           
1 176558   USB-C Charg…        2       12.0 2019-04-19 08:46:00 917 1st St, Dal…
2 176559   Bose SoundS…        1      100.0 2019-04-07 22:30:00 682 Chestnut St…
3 176560   Google Phone        1      600   2019-04-12 14:38:00 669 Spruce St, …
4 176560   Wired Headp…        1       12.0 2019-04-12 14:38:00 669 Spruce St, …
5 176561   Wired Headp…        1       12.0 2019-04-30 09:27:00 333 8th St, Los…
6 176562   USB-C Charg…        1       12.0 2019-04-29 13:03:00 381 Wilson St, …
# ℹ 1 more variable: city <chr>

:::

Col_names Data_type Description Missing_values Cleaning_methods
order_id Character Unique ID for each transaction 0 Renamed from “Order ID”.
product Character Name of the electronics item 0 Removed “Product” header repeats.
quantity Numeric Units sold per order 0 Coerced to numeric from character.
unit_price Numeric Price per unit in USD 0 Imputed median by “product”.
order_date POSIXct Timestamp of purchase 0 Parsed using lubridate::mdy_hm().
city Character City extracted from address 0 Created using str_split_i().

0.3.3 Justification of Choices

Handling Headers: I used filter(product != “Product”) because the 12-month merge duplicates the header row 11 times. Leaving these in would cause errors in numeric conversion.

Imputation Strategy: I chose grouped median imputation for prices because electronics have fixed prices per SKU. A mean might be skewed by outliers, whereas the median for a specific product like “iPhone” is highly accurate.

Date Formatting: I converted order_date to a POSIXct object immediately to allow for time-series analysis.

0.4 Section 4: Data Analysis

In this section, I perform a series of analytical operations to extract business insights from the consolidated electronics sales data.

0.4.1 Product-discovery

I will identify the most expensive items in the inventory to see where the “Premium” tier naturally begins.

# 1. List unique products and their prices, sorted by price
# I use .by to ensure we get the correct price for each item 
high_value_products <- sales_cleaned |>
  summarise(
    unit_price = max(unit_price), 
    .by = product
  ) |>
  arrange(desc(unit_price))

# 2. Display the top 10 most expensive products
# This justifies why we will later use the $600 threshold
high_value_products |> head(10)
# A tibble: 10 × 2
   product                unit_price
   <chr>                       <dbl>
 1 Macbook Pro Laptop          1700 
 2 ThinkPad Laptop             1000.
 3 iPhone                       700 
 4 Google Phone                 600 
 5 LG Dryer                     600 
 6 LG Washing Machine           600 
 7 Vareebadd Phone              400 
 8 27in 4K Gaming Monitor       390.
 9 34in Ultrawide Monitor       380.
10 Flatscreen TV                300 
Note

I have filtered the high value product and apperently Macbook, ThinkPad and iPhone are the highest, and all three of them are greater than $600. Focusing on these products is useful for examining premium sales dynamics, such as revenue concentration and customer willingness to pay.

0.4.2 Global High-Value Analysis

The global approach ensures that my analysis captures all high-end electronics (like ThinkPads and high-end monitors) rather than focusing solely on a single brand. I then calculated the total revenue and order volume for each segment to see which tier drives the store’s financial performance.Now that I have confirmed which products are expensive (likely MacBooks, iPhones, and ThinkPads), we can filter for all high-value products which are greater than $600 regardless of the brand. I will assign the the segment by regarding the price e.g. >$600 for “Premium”, >$150 for “Mid-Range”, and <=$150 for “Budget”.

# Numeric filter for all products > 600
# This creates a 'Premium Segment' dataset
high_value_segment <- sales_cleaned |>
  filter(unit_price > 600)

# Use case_when to categorize the WHOLE dataset
# This adds context to every row (Above & Beyond categorization)
sales_analysis <- sales_cleaned |>
  mutate(month = month(order_date, label = TRUE, abbr = TRUE), # Extracts the month name from the order_date
    market_segment = case_when(
    unit_price > 600  ~ "Premium",
    unit_price > 150  ~ "Mid-Range",
    unit_price <= 150 ~ "Budget"
  ))

# Summary of sales by market segment
segment_summary <- sales_analysis |>
  summarise(
    total_revenue = sum(quantity * unit_price),
    avg_order_size = mean(quantity),
    transaction_count = n(),
    .by = market_segment
  ) |>
  arrange(desc(total_revenue)) 
head(high_value_segment)
# A tibble: 6 × 7
  order_id product      quantity unit_price order_date          purchase_address
  <chr>    <chr>           <dbl>      <dbl> <dttm>              <chr>           
1 176565   Macbook Pro…        1      1700  2019-04-24 10:38:00 915 Willow St, …
2 176581   iPhone              1       700  2019-04-09 21:38:00 84 Jackson St, …
3 176608   iPhone              1       700  2019-04-11 12:01:00 15 Cherry St, S…
4 176633   ThinkPad La…        1      1000. 2019-04-23 14:03:00 863 Hickory St,…
5 176639   Macbook Pro…        1      1700  2019-04-28 16:14:00 853 Cedar St, S…
6 176643   Macbook Pro…        1      1700  2019-04-27 21:32:00 373 Adams St, B…
# ℹ 1 more variable: city <chr>
head(sales_analysis)
# A tibble: 6 × 9
  order_id product      quantity unit_price order_date          purchase_address
  <chr>    <chr>           <dbl>      <dbl> <dttm>              <chr>           
1 176558   USB-C Charg…        2       12.0 2019-04-19 08:46:00 917 1st St, Dal…
2 176559   Bose SoundS…        1      100.0 2019-04-07 22:30:00 682 Chestnut St…
3 176560   Google Phone        1      600   2019-04-12 14:38:00 669 Spruce St, …
4 176560   Wired Headp…        1       12.0 2019-04-12 14:38:00 669 Spruce St, …
5 176561   Wired Headp…        1       12.0 2019-04-30 09:27:00 333 8th St, Los…
6 176562   USB-C Charg…        1       12.0 2019-04-29 13:03:00 381 Wilson St, …
# ℹ 3 more variables: city <chr>, month <ord>, market_segment <chr>
head(segment_summary)
# A tibble: 3 × 4
  market_segment total_revenue avg_order_size transaction_count
  <chr>                  <dbl>          <dbl>             <int>
1 Premium            16961859.           1.00             15694
2 Mid-Range          11169956.           1.00             26113
3 Budget              6360222.           1.16            144143

0.4.3 Refining the Data and Categorization

To keep analysis simple, I used:

  • Column selection helpers (like ends_with())
  • Conditional mutation if_else() for binary flags

This created an order_type variable to separate normal purchases from “Multi-Unit” bulk orders – important for understanding buying patterns.

# 1. select() with helper functions 
# I select the ID and any column related to the 'order' or 'product'
refined_view <- sales_analysis |>
  select(order_id, starts_with("order"), contains("product"))

# 2. mutate() with if_else() 
# Creating a binary flag to identify "Multi-Unit" orders 
sales_analysis <- sales_analysis |>
  mutate(order_type = if_else(quantity > 1, "Multi-Unit", "Single-Unit"))

# 3. arrange() in ascending order 
# Checking our lowest-priced items first
budget_check <- sales_analysis |>
  arrange(unit_price) |>
  head(10)
head(refined_view)
# A tibble: 6 × 3
  order_id order_date          product                   
  <chr>    <dttm>              <chr>                     
1 176558   2019-04-19 08:46:00 USB-C Charging Cable      
2 176559   2019-04-07 22:30:00 Bose SoundSport Headphones
3 176560   2019-04-12 14:38:00 Google Phone              
4 176560   2019-04-12 14:38:00 Wired Headphones          
5 176561   2019-04-30 09:27:00 Wired Headphones          
6 176562   2019-04-29 13:03:00 USB-C Charging Cable      
head(sales_analysis)
# A tibble: 6 × 10
  order_id product      quantity unit_price order_date          purchase_address
  <chr>    <chr>           <dbl>      <dbl> <dttm>              <chr>           
1 176558   USB-C Charg…        2       12.0 2019-04-19 08:46:00 917 1st St, Dal…
2 176559   Bose SoundS…        1      100.0 2019-04-07 22:30:00 682 Chestnut St…
3 176560   Google Phone        1      600   2019-04-12 14:38:00 669 Spruce St, …
4 176560   Wired Headp…        1       12.0 2019-04-12 14:38:00 669 Spruce St, …
5 176561   Wired Headp…        1       12.0 2019-04-30 09:27:00 333 8th St, Los…
6 176562   USB-C Charg…        1       12.0 2019-04-29 13:03:00 381 Wilson St, …
# ℹ 4 more variables: city <chr>, month <ord>, market_segment <chr>,
#   order_type <chr>
head(budget_check)
# A tibble: 6 × 10
  order_id product      quantity unit_price order_date          purchase_address
  <chr>    <chr>           <dbl>      <dbl> <dttm>              <chr>           
1 176575   AAA Batteri…        1       2.99 2019-04-27 00:30:00 433 Hill St, Ne…
2 176583   AAA Batteri…        2       2.99 2019-04-20 12:00:00 146 Jackson St,…
3 176586   AAA Batteri…        2       2.99 2019-04-10 17:00:00 365 Center St, …
4 176598   AAA Batteri…        2       2.99 2019-04-21 20:33:00 294 Park St, Sa…
5 176607   AAA Batteri…        2       2.99 2019-04-18 12:11:00 689 10th St, Bo…
6 176610   AAA Batteri…        1       2.99 2019-04-26 00:40:00 382 River St, D…
# ℹ 4 more variables: city <chr>, month <ord>, market_segment <chr>,
#   order_type <chr>

0.4.4 Advanced Reshaping: The City-Segment Matrix

I use a pivot operation to transform the data from a long format into a wide, readable matrix. This cross-tabulation provides a clear comparative view of how revenue is distributed across different cities and market segments simultaneously. This view is particularly useful for identifying which geographic regions have the highest concentration of “Premium” shoppers.

# Pivot Operation
# This transforms the data from 'Long' to 'Wide' format for a summary report
city_segment_matrix <- sales_analysis |>
  summarise(revenue = sum(quantity * unit_price), .by = c(city, market_segment)) |>
  pivot_wider(names_from = market_segment, values_from = revenue)

# Display the result
city_segment_matrix
# A tibble: 9 × 4
  city            Budget `Mid-Range`  Premium
  <chr>            <dbl>       <dbl>    <dbl>
1 Dallas         497339.     912940. 1357697.
2 Boston         668830.    1204417. 1788396.
3 Los Angeles   1011927.    1740350. 2700294.
4 San Francisco 1534063.    2671251. 4056890.
5 Seattle        501730.     925629. 1320397.
6 Atlanta        509482.     903920. 1382096.
7 New York City  859720.    1511003. 2293594.
8 Portland       435602.     722992. 1161897.
9 Austin         341530.     577454.  900598.

0.4.5 Advanced Filtering & Top/Bottom Slicing

To demonstrate precise data extraction, I used string matching and set membership (%in%) to isolate premium “Pro” models in specific tech-heavy cities. I also utilized slice_max() to identify the top 5 most profitable transactions, providing a snapshot of the store’s most successful individual sales.

# Identifying high-end 'Pro' or 'ThinkPad' models in key tech hubs
tech_hub_premium <- sales_analysis |>
  filter(
    city %in% c("San Francisco", "Seattle", "Austin"),
    str_detect(product, "Pro|ThinkPad")
  )

# Slice_max (Top 5 most profitable individual orders)
top_5_orders <- sales_analysis |>
  mutate(revenue = quantity * unit_price) |>
  slice_max(revenue, n = 5)
head(tech_hub_premium)
# A tibble: 6 × 10
  order_id product      quantity unit_price order_date          purchase_address
  <chr>    <chr>           <dbl>      <dbl> <dttm>              <chr>           
1 176565   Macbook Pro…        1      1700  2019-04-24 10:38:00 915 Willow St, …
2 176639   Macbook Pro…        1      1700  2019-04-28 16:14:00 853 Cedar St, S…
3 176681   ThinkPad La…        1      1000. 2019-04-20 10:39:00 331 Cherry St, …
4 176747   Macbook Pro…        1      1700  2019-04-27 15:04:00 645 7th St, San…
5 176749   ThinkPad La…        1      1000. 2019-04-26 07:55:00 322 Adams St, S…
6 176806   Macbook Pro…        1      1700  2019-04-22 13:26:00 197 Cherry St, …
# ℹ 4 more variables: city <chr>, month <ord>, market_segment <chr>,
#   order_type <chr>
head(top_5_orders)
# A tibble: 6 × 11
  order_id product      quantity unit_price order_date          purchase_address
  <chr>    <chr>           <dbl>      <dbl> <dttm>              <chr>           
1 181069   Macbook Pro…        2      1700  2019-04-27 21:01:00 668 Park St, Sa…
2 181544   Macbook Pro…        2      1700  2019-04-22 12:48:00 731 11th St, Ne…
3 210292   Macbook Pro…        2      1700  2019-06-08 09:00:00 953 Ridge St, S…
4 200528   Macbook Pro…        2      1700  2019-05-13 13:40:00 643 4th St, Bos…
5 149611   ThinkPad La…        2      1000. 2019-01-31 17:47:00 850 6th St, New…
6 278637   ThinkPad La…        2      1000. 2019-10-02 16:06:00 643 Cedar St, B…
# ℹ 5 more variables: city <chr>, month <ord>, market_segment <chr>,
#   order_type <chr>, revenue <dbl>

0.4.6 Window Functions & Time Series

I implemented window functions including cumsum() and lag() to analyze the store’s growth trajectory. By calculating the cumulative revenue and the daily change in sales, I can visualize the store’s momentum and identify specific days where revenue significantly outperformed the previous period.

# Calculating daily revenue growth and comparing to the previous day
daily_growth <- sales_analysis |>
  summarise(daily_rev = sum(quantity * unit_price), .by = order_date) |>
  arrange(order_date) |>
  mutate(
    cumulative_revenue = cumsum(daily_rev),
    prev_day_rev = lag(daily_rev),
    rev_change = daily_rev - prev_day_rev
  )
head(daily_growth)
# A tibble: 6 × 5
  order_date          daily_rev cumulative_revenue prev_day_rev rev_change
  <dttm>                  <dbl>              <dbl>        <dbl>      <dbl>
1 2019-01-01 03:07:00     12.0                12.0        NA       NA     
2 2019-01-01 03:40:00     12.0                23.9        12.0     -0.0400
3 2019-01-01 04:56:00    150                 174.         12.0    138.    
4 2019-01-01 05:53:00      2.99              177.        150     -147.    
5 2019-01-01 06:03:00     23.9               201.          2.99    20.9   
6 2019-01-01 06:13:00    700                 901.         23.9    676.    
sum(is.na(daily_growth))
[1] 2

There are two NaN value and I replace it with 0.

# Replacing NaN values with 0
daily_growth <- sales_analysis |>
  summarise(daily_rev = sum(quantity * unit_price), .by = order_date) |>
  arrange(order_date) |>
  mutate(
    cumulative_revenue = cumsum(daily_rev),
    prev_day_rev = lag(daily_rev, default = 0),
    rev_change = daily_rev - prev_day_rev
  )
head(daily_growth)
# A tibble: 6 × 5
  order_date          daily_rev cumulative_revenue prev_day_rev rev_change
  <dttm>                  <dbl>              <dbl>        <dbl>      <dbl>
1 2019-01-01 03:07:00     12.0                12.0         0       12.0   
2 2019-01-01 03:40:00     12.0                23.9        12.0     -0.0400
3 2019-01-01 04:56:00    150                 174.         12.0    138.    
4 2019-01-01 05:53:00      2.99              177.        150     -147.    
5 2019-01-01 06:03:00     23.9               201.          2.99    20.9   
6 2019-01-01 06:13:00    700                 901.         23.9    676.    

0.4.7 Correlation & Stats

Finally, I performed a correlation analysis to examine the relationship between price and quantity. This helps determine if higher price points lead to a significant drop in the number of units sold. I also used a cross-tabulation to count how order types (Single vs. Multi-Unit) are distributed across the different market segments.

# Checking if people buy fewer items when the price is higher
price_quantity_corr <- sales_analysis |>
  summarise(correlation = cor(unit_price, quantity))

# Breakdown of order types by market segment
segment_counts <- sales_analysis |>
  count(market_segment, order_type)
head(price_quantity_corr)
# A tibble: 1 × 1
  correlation
        <dbl>
1      -0.148
head(segment_counts)
# A tibble: 6 × 3
  market_segment order_type       n
  <chr>          <chr>        <int>
1 Budget         Multi-Unit   17324
2 Budget         Single-Unit 126819
3 Mid-Range      Multi-Unit      61
4 Mid-Range      Single-Unit  26052
5 Premium        Multi-Unit      13
6 Premium        Single-Unit  15681
Note

The calculated correlation between unit price and quantity is -0.148. While this confirms a negative relationship—supporting the law of demand—the correlation is weak. This is primarily because consumer electronics are ‘discreet’ purchases; the data shows that whether a customer is buying a budget charging cable or a premium laptop, the modal (most common) quantity remains one unit.

0.5 Section 5: Visualization

0.5.1 Distribution of Pricing (Boxplot)

Due to the significant skewness in product pricing—ranging from $2.99 batteries to $1,700 MacBooks—a standard linear histogram hides the distribution of premium goods. I have chosen a Boxplot to more effectively visualize the volume of high-value transactions relative to budget accessories.

# 1. Boxplot
p1 <- ggplot(sales_analysis, aes(x = market_segment, y = unit_price, fill = market_segment)) +
  geom_boxplot(alpha = 0.7) +
  scale_y_continuous(labels = label_dollar()) +
  labs(title = "Price Distribution by Market Segment",
       subtitle = "Visualizing the spread and outliers across tiers",
       x = "Market Segment", y = "Unit Price") +
  theme_minimal() +
  scale_fill_viridis_d(option = "mako")

p1
Figure 1: Distribution of unit prices across market segments.
Note

The Question: Is there a big difference between “Budget” and “Premium” prices?

The Finding: This chart shows that Budget items are all priced very closely together, while Premium items (like MacBooks) have a very wide price range. It proves our $600 “High-Value” rule was the right choice.

0.5.2 Revenue by City & Segment (Bivariate & Grouped)

This bar chart shows which cities generate the most revenue, broken down by market segment.

# 2. Bar Chart
p2 <- sales_analysis |>
  summarise(total_rev = sum(quantity * unit_price), .by = c(city, market_segment)) |>
  ggplot(aes(x = reorder(city, total_rev), y = total_rev, fill = market_segment)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = label_comma(accuracy = 1, scale = 1e-6, suffix = "M")(total_rev)), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  scale_y_continuous(labels = label_dollar(scale = 1e-6, suffix = "M")) +
  scale_fill_viridis_d(option = "magma", end = 0.8) +
  labs(title = "Total Revenue by City and Market Segment",
       x = "City", y = "Total Revenue ($ Millions)", fill = "Segment") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p2
Note

The Question: Which cities spend the most on expensive electronics?

The Finding: Big tech cities like San Francisco and NYC are the clear leaders. They don’t just buy more items; they specifically buy more “Premium” items compared to smaller cities.

0.5.3 Correlation: Price vs. Quantity (Scatter Plot)

To visualize the -0.148 correlation calculated from Correlation & Stats session, I used a scatter plot with a trend line.

# 3. Scatter Plot 
p3 <- ggplot(sales_analysis |> sample_n(5000), aes(x = unit_price, y = quantity)) +
  geom_jitter(alpha = 0.2, color = "#3498db", size = 1) + # Size/Alpha mapping
  geom_smooth(method = "lm", color = "red", se = FALSE) + # Trend line
  labs(title = "Correlation: Price vs. Quantity Ordered",
       subtitle = "Sample of 5,000 orders showing weak negative correlation",
       x = "Unit Price ($)", y = "Quantity") +
  theme_minimal()
p3
`geom_smooth()` using formula = 'y ~ x'
Figure 3: Correlation between unit price and quantity.
Note

The Question: Do people buy fewer items when the price goes up?

The Finding: Yes, but only slightly. The trend line goes down, which proves that people usually buy expensive laptops one at a time, but might buy several cheap cables at once.

0.5.5 Dashboard (Patchwork)

I combine the key insights into a unified dashboard for an executive summary.

# 6. Combined Dashboard 
library(patchwork)

dashboard <- (p1 + p3) / p2 + 
  plot_annotation(
    title = "Electronics Store Annual Sales Dashboard",
    subtitle = "Analysis of 180k+ transactions from 2019",
    caption = "Data Source: Kaggle Electronics Sales Dataset"
  )

dashboard
`geom_smooth()` using formula = 'y ~ x'
Figure 5: Combined Dashboard

0.5.6 Seasonal Heatmap of Orders

# 7. Heatmap
p5 <- sales_analysis |>
  # First, create a 'day_of_week' column
  mutate(day_of_week = wday(order_date, label = TRUE)) |>
  count(month, day_of_week) |>
  ggplot(aes(x = month, y = day_of_week, fill = n)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(option = "plasma") +
  labs(title = "Order Volume Heatmap",
       subtitle = "Identifying peak shopping periods by day and month",
       x = "Month", y = "Day of Week", fill = "Orders") +
  theme_minimal()
p5
Figure 6: Seasonal Heatmap of Orders.
Note

The Question: What are the busiest days and months for orders?

The Finding: This map highlights “hot spots” in December. It tells a business manager exactly when they need more staff in the warehouse to handle the extra orders.

0.5.7 Price vs. Quantity by Segment (Multivariate Scatter)

Instead of a sample of 5,000, I will look at the mean relationship for every product, using color to show the segments.

# 8. Advanced Scatter 
p6 <- sales_analysis |>
  summarise(
    avg_price = mean(unit_price),
    avg_qty = mean(quantity),
    .by = c(product, market_segment)
  ) |>
  ggplot(aes(x = avg_price, y = avg_qty, color = market_segment)) +
  geom_point(size = 4, alpha = 0.8) +
  geom_text(aes(label = product), check_overlap = TRUE, vjust = -1, size = 3) +
  scale_color_viridis_d() +
  labs(title = "Product Level: Price vs. Average Quantity",
       x = "Average Unit Price", y = "Average Quantity per Order") +
  theme_classic()
p6
Figure 7: Price vs. Quantity by Segment.
Note

The Question: Which specific products are the “Best Sellers” in each category?

The Finding: By looking at individual products, we can see which ones are expensive but still very popular. This helps us decide which items to put on the front page of the website.

0.5.8 Daily Revenue Volatility (Time Series)

Using the daily_growth data you created in Section 4, I can plot the daily change in revenue to show how “stable” the business is.

# 9. Line Chart 
p7 <- daily_growth |>
  ggplot(aes(x = order_date, y = rev_change)) +
  geom_line(color = "grey") +
  geom_point(aes(color = rev_change > 0), size = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Daily Revenue Volatility",
       subtitle = "Points above the red line indicate growth vs. previous day",
       x = "Date", y = "Revenue Change ($)") +
  theme_minimal() +
  theme(legend.position = "none")
p7
Figure 8: Daily Revenue Volatility.
Note

The Question: How much does revenue change from one day to the next?

The Finding: This chart shows “volatility.” If the dots are far from the red line, it means sales are jumping up and down. This helps us see if our daily promotions are actually working.

0.5.9 Cumulative Growth Curve

This final plot shows the “Story of the Year”—how the store reached its total annual revenue.

# 10. Area Chart
p8 <- daily_growth |>
  ggplot(aes(x = order_date, y = cumulative_revenue)) +
  geom_area(fill = "steelblue", alpha = 0.4) +
  geom_line(color = "steelblue", linewidth = 1) +
  scale_y_continuous(labels = label_dollar(scale = 1e-6, suffix = "M")) +
  labs(title = "Annual Cumulative Revenue Growth",
       x = "Date", y = "Total Revenue ($ Millions)") +
  theme_minimal()
p8
Figure 9: Cumulative Growth Curve.
Note

The Question: How did the store’s total money grow over the whole year?

The Finding: This is the “Big Picture.” It shows that the store started slow but finished the year with massive growth, reaching its multi-million dollar goal by December.

0.6 Section 6: Interactive Elements

0.6.1 Interactive Data Exploration (DT)

Instead of a static table, I have provided an interactive data browser. This allows you to filter by city, search for specific products, or sort by revenue to find hidden patterns.

# Make sure library(DT) is loaded
library(DT)

table_data <- sales_analysis |>
  summarise(
    Total_Revenue = sum(quantity * unit_price),
    Units_Sold = sum(quantity),
    Avg_Price = mean(unit_price),
    .by = c(city, product, market_segment)
  )

datatable(
  table_data,
  extensions = "Buttons", 
  filter = 'top',         
  options = list(
    dom = 'Bfrtip',       
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print'), # Download options
    pageLength = 10, 
    autoWidth = TRUE
  ),
  colnames = c("City", "Product", "Segment", "Total Revenue", "Units Sold", "Avg Price"),
  caption = 'Table 1: Interactive Sales Summary by Geography and Product Tier'
) |> 
  formatCurrency(c('Total_Revenue', 'Avg_Price')) |>
  formatRound('Units_Sold', 0)
Figure 10: Interactive sales table with city/product revenue summary (filter, sort, download).
Note

Research Insight: Data Portability The Question: How can managers use this data for their own reports?

The Finding: By adding CSV and Excel buttons, I turn a static report into a functional business tool. Stakeholders can filter for a specific city or product and instantly download that specific data subset.

Tip

I have implemented the dom = ‘Bfrtip’ configuration. This ensures that the Buttons, filter, reader, table, information, and pagination all load in a clean, professional layout.

Important

Exporting Filtered Data The download buttons are “reactive”. If you use the filters at the top of the columns to show only “Premium” items, clicking the Excel button will download only those filtered premium rows, not the entire dataset.

0.6.2 Interactive City Performance (ggplotly)

I have converted the Section 5.2 bar chart into an interactive version. You can now hover over any bar to see the exact dollar amount and the specific market segment.

# Convert ggplot to ggplotly
library(plotly)

# Using the previous p2 object but adding a 'text' aesthetic for custom hover
p2_interactive <- sales_analysis |>
  summarise(total_rev = sum(quantity * unit_price), .by = c(city, market_segment)) |>
  mutate(hover_text = paste("City:", city, 
                            "<br>Segment:", market_segment, 
                            "<br>Revenue: $", format(round(total_rev, 0), big.mark = ","))) |>
  ggplot(aes(x = reorder(city, total_rev), y = total_rev, fill = market_segment, text = hover_text)) +
  geom_col(position = "dodge") +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal() +
  labs(title = "Hover to Explore Revenue by City", x = "City", y = "Total Revenue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Convert to interactive
ggplotly(p2_interactive, tooltip = "text")
Figure 11: Interactive Bar Chart.
Note

Research Insight: Geographic Revenue The Question: Which specific segment is the primary driver for our top-earning cities?

The Finding: Hovering over the bars reveals that in San Francisco, the “Premium” segment contributes more than double the revenue of the “Budget” and “Mid-Range” segments combined.

Tip

Navigation Tip You can double-click on a segment in the legend to isolate that specific tier across all cities, or click and drag to zoom into a specific group of cities.

Important

Comparative Analysis Note that the y-axis represents total revenue. While some bars look similar in height, the hover text provides the exact dollar precision needed for financial auditing.

0.6.3 Product Value Analysis (Native plot_ly)

This plot is built using “Native Plotly” logic. It visualizes the relationship between a product’s average price and its total revenue contribution.

# Native plotly with plot_ly()
# Custom hover text with 3+ variables
plot_ly(
  data = table_data,
  x = ~Avg_Price,
  y = ~Total_Revenue,
  color = ~market_segment,
  type = 'scatter',
  mode = 'markers',
  marker = list(size = 10, opacity = 0.6),
  text = ~paste("Product:", product, 
                "<br>City:", city, 
                "<br>Total Revenue: $", format(Total_Revenue, big.mark = ",")),
  hoverinfo = 'text'
) |>
  layout(
    title = "Interactive Product Performance Matrix",
    xaxis = list(title = "Average Unit Price ($)"),
    yaxis = list(title = "Total Revenue ($)")
  )
Figure 12: Product price vs revenue by market segment (hover for details).
Note

Research Insight: Value Matrix The Question: Is there a clear “break point” where price starts to impact total revenue?

The Finding: This scatter plot shows a “revenue cluster” for items priced between $600 and $1,700, indicating that our “Premium” strategy is successfully capturing high-value transactions.

Tip

Interactive Discovery Hover over the outliers on the far right of the x-axis to identify the specific high-end products that drive our top-tier revenue despite having lower sales frequency.

Important

Multi-Variable Context The hover text for each point displays three key variables: Product Name, City, and Total Revenue, allowing for a three-dimensional understanding of performance in a single view.

0.7 Section 7: Conclusion

Question 1: Does product pricing significantly impact customer purchasing volume?

Yes. My analysis confirms a negative correlation of -0.148 between unit price and quantity. As you can see in the Quantity Scatter Plot Figure 3, high-value items are almost exclusively purchased as single units, whereas budget items drive higher multi-unit transaction volume.

Question 2: Which geographic regions are the primary drivers for premium revenue?

The data shows that coastal tech hubs are the dominant force. According to the City Revenue Bar Chart Figure 2, San Francisco leads with a total revenue of $8,262,204, driven largely by the Premium segment. Smaller inland cities show significantly lower “willingness to pay” for high-end hardware.

Question 3: How does seasonality affect different market tiers?

Seasonality is most aggressive in the Premium tier. While the Budget segment remains stable, the Faceted Line Chart Figure 4 reveals a massive revenue spike in the Premium segment during the final quarter, likely due to holiday gifting and year-end corporate spending.

Geographic Targeting: Reallocate 20% of the marketing budget from lower-performing cities like Austin or Portland toward San Francisco and NYC specifically for Premium product promotions.

Inventory Optimization: Implement a “Just-in-Time” inventory model for Budget accessories (cables/batteries) during Q1-Q3 to reduce holding costs, as demand is highly predictable and flat.

Bundling Strategy: Since the correlation between price and quantity is weak (-0.148), the store should bundle Premium items with Budget accessories (e.g., a MacBook + USB-C Hub) to increase the “Units Per Transaction” in the high-value segment.

While the analysis is robust, it has the following limitations:

Time Constraint: The dataset is limited to a single year (2019), which prevents me from distinguishing between “Yearly Growth” and “Standard Seasonality.”

Missing Demographics: We lack customer age or gender data, meaning we cannot build detailed persona-based marketing strategies.

External Factors: The analysis does not account for external market benchmarks or competitor pricing, which likely influenced the observed sales trends.

Methodology Reflection: Choosing a Boxplot over a Histogram for Section 5 was a critical choice. The extreme price skewness made traditional distribution plots unreadable, and the boxplot allowed for a much cleaner comparison of price “spread” across tiers.

Future studies should incorporate Customer Lifetime Value (CLV) modeling. By joining this sales data with customer support or return data, we could identify which segments are not only high-revenue but also high-profit after accounting for post-sale costs.

0.8 AI Usage Log

I used Gemini 2.0 Flash as a coding assistant throughout this project to help with technical challenges and document formatting. At the start, I asked Gemini to help me understand specific R error messages, such as the “incomplete string” error in my inline code. I also asked Gemini to debug my custom function for calculating city revenue because I was having trouble with the function scope. When I moved to the visualization sections, I asked Gemini for syntax assistance to use the native pipe |> and the .by argument instead of group_by().

To improve the report’s structure, I asked Gemini to explain how to use Quarto features like callout blocks and tabsets. For the final bonus section, I asked Gemini to help me make my presentation more beautiful by suggesting an AI-themed layout with columns. Every time the AI provided code, I modified the labels and variable names to match my sales_analysis dataset and tested each chunk to ensure it ran without errors. I made sure to manually check all the mathematical outputs, like the San Francisco revenue total, against my raw data to ensure accuracy before final rendering.

1 Presentation

1.1 RevealJS Presentation Structure

1.3 Slide 2: Business Objectives

1.3.1 Core Research Pillars

  • Primary Goal: Identify high-value geographic hubs to optimize marketing spend.
  • Market Dynamics: Analyze price elasticity and volume trends across diverse product segments.
  • Seasonality: Determine peak periods for premium hardware acquisition.

1.4 Slide 3: Data Methodology

Processed 185,000+ raw transaction rows with 100% validation.

  • Scale: Processed 185,000+ transaction rows from 12 monthly datasets.
  • Cleaning: Handled missing values and standardized data types for time-series analysis.
  • Feature Engineering: Segmented products into Budget, Mid-Range, and Premium based on unit price distribution.

1.5 Slide 4: Key Finding 1: Geographic Revenue Leaders

p2

  • Highlight: San Francisco is the clear leader, generating $8,262,204.

1.6 Slide 5: The “Tech-Hub” Effect

  • Concentration: Revenue is heavily centered in coastal hubs (SF, NYC, Seattle).
  • Segment Mix: These cities show a significantly higher volume of Premium sales (MacBooks, iPhones) compared to inland counterparts.
  • Insight: Regional economic profiles directly correlate with “Willingness to Pay” for high-end electronics.

1.7 Slide 6: Key Finding 2: Price vs. Volume Correlation

Correlation Matrix

p3
`geom_smooth()` using formula = 'y ~ x'

  • Statistic: A negative correlation of -0.148 confirms high price sensitivity for non-premium items.

1.8 Slide 7: Volume vs. Value Strategy

  • The Pareto Principle: While Budget items drive transaction frequency, Premium items drive total revenue.

  • Cash Flow: Budget accessories provide consistent, predictable volume.

  • Profitability: High-ticket items represent the primary engine for margin growth despite lower unit counts.

1.9 Slide 8: The Q4 Seasonality Surge

p4

  • Holiday Spike: Premium segments experience an aggressive revenue surge in December.

  • Budget Stability: Low-cost accessories remain relatively flat throughout the year.

  • Conclusion: Marketing efforts for high-margin goods must be maximized during the Q4 window.

1.10 Slide 9: Strategic Recommendation: Reallocation

  • Action: Reallocate 20% of the marketing budget toward San Francisco and NYC.

  • Logic: Target demographics with established premium purchase patterns to maximize Return on Ad Spend (ROAS).

  • Efficiency: Focus spend where “conversion-to-revenue” ratios are highest.

1.11 Slide 10: Strategic Recommendation: Bundling

  • Action: Bundle Premium hardware with high-margin Budget accessories (e.g., MacBook + USB-C Hub).

  • Logic: Offset the negative price-volume correlation by increasing Units Per Transaction (UPT).

  • Customer Value: Enhance the out-of-the-box experience while boosting average order value.

1.12 Slide 11: Future Research & Conclusion

  • Summary: Success is driven by geographic targeting and understanding segment-specific seasonality.

  • Next Steps: Implement Customer Lifetime Value (CLV) modeling to identify high-profit segments over time.

  • Expansion: Incorporate competitor pricing data for 2026 market forecasting.